GLR* : A Robust Grammar-Focused Parser for Spontaneously Spoken Language
نویسندگان
چکیده
The analysis of spoken language is widely considered to be a more challenging task than the analysis of written text. All of the difficulties of written language can generally be found in spoken language as well. Parsing spontaneous speech must, however, also deal with problems such as speech disfluencies, the looser notion of grammaticality, and the lack of clearly marked sentence boundaries. The contamination of the input with errors of a speech recognizer can further exacerbate these problems. Most natural language parsing algorithms are designed to analyze “clean” grammatical input. Because they reject any input which is found to be ungrammatical in even the slightest way, such parsers are unsuitable for parsing spontaneous speech, where completely grammatical input is the exception more than the rule. This thesis describes GLR*, a parsing system based on Tomita’s Generalized LR parsing algorithm, that was designed to be robust to two particular types of extra-grammaticality: noise in the input, and limited grammar coverage. GLR* attempts to overcome these forms of extra-grammaticality by ignoring the unparsable words and fragments and conducting a search for the maximal subset of the original input that is covered by the grammar. The parser is coupled with a beam search heuristic, that limits the combinations of skipped words considered by the parser, and ensures that the parser will operate within feasible time and space bounds. The developed parsing system includes several tools designed to address the difficulties of parsing spontaneous speech. To cope with high levels of ambiguity, we developed a statistical disambiguation module, in which probabilities are attached directly to the actions in the LR parsing table. The parser must also determine the “best” parse from among the different parsable subsets of an input. We thus designed a general framework for combining a collection of parse evaluation measures into an integrated heuristic for evaluating and ranking the parses produced by the GLR* parser. This framework was applied to a set of four parse scoring measures developed for the JANUS scheduling domain and the ATIS domain. We added a parse quality heuristic, that allows the parser to self-judge the quality of the parse chosen as best, and to detect cases in which important information is likely to have been skipped. To demonstrate its suitability to parsing spontaneous speech, the GLR* parser was integrated into the JANUS speech translation system. Our evaluations on both transcribed and speech recognized input have indicated that the version of the system that uses GLR* produces between 15% and 30% more acceptable translations, than a corresponding version that uses the original non-robust GLR parser. We also developed a version of GLR* that is suitable to parsing word lattices produced by the speech recognizer, and investigated how lattice parsing can potentially overcome errors of the speech recognizer and further improve end-to-end performance of the speech translation system.
منابع مشابه
Glr* : a Robust Parser for Spontaneously Spoken Language
This paper describes GLR*, a parsing system based on Tomita's Generalized LR parsing algorithm, that was designed to be robust to two particular types of extra-grammaticality: noise in the input, and limited grammar coverage. GLR* attempts to overcome these forms of extra-grammaticality by ignoring the unparsable words and fragments and conducting a search for the maximal subset of the original...
متن کاملJANUS: a Multi-lingual Speech-to-speech Translation System for Spontaneously Spoken Language in a Limited Domain
Janus is a multilingual speech translation system currently operating in the domain of meeting scheduling. Translating spontaneous speech requires a high degree of robustness to overcome the dissuencies of spoken language as well as errors in speech recognition. In this system description, we focus on the robust speech translation components in Janus|the skipping GLR* parser, the segmentation o...
متن کاملUGLR Parser for Phrase Structure Languages as an Extension of GLR Parser
This paper proposes the UGLR parser as an extension of the GLR parser. A UGLR parser is powerful enough to parse deterministically any phrase structure language if it is in the class of recursive languages and can parse any context free language as fast as the conventional GLR parser. Natural language processing often requires a parser for languages belonging to classes larger than that of cont...
متن کاملPROFER: predictive, robust finite-state parsing for spoken language
The natural languageprocessingcomponentof a speechunderstanding system is commonly a robust, semantic parser, implemented as either a chart-based transition network, or as a generalized leftright (GLR) parser. In contrast, we are developing a robust, semantic parser that is a single, predictive finite-state machine. Our approach is motivated by our belief that such a finite-state parser can ult...
متن کاملDesign of a semantic parser with support to ellipsis resolution in a Chinese spoken language dialogue system
In this paper, a semantic parser with support to ellipsis resolution in a Chinese spoken language dialogue system is proposed. The grammar and parsing strategy of this parser is designed to address the characteristics of spoken language and to support the ellipsis resolution. Namely, it parses the user utterance with a domain-specific semantic grammar based on a template-filling approach. Synta...
متن کامل